Using Inverted Files to Compress Text

نویسنده

  • Strahil Ristov
چکیده

This is the first report on a new approach to text compression. It consists of representing the text file with compressed inverted file index in conjunction with very compact lexicon, where lexicon includes every word in the text. The index is compressed using standard index compression techniques, and lexicon is compressed by original dictionary compression method that gives better compression results than existing procedures. Compression procedure is complex, but decompression time is linear with the file size, although it requires two passes and hence can not be performedonline. First experiments show that this method, when refined, can be competitive for larger texts that only need to be decompressed in the real time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel methods for the update of partitioned inverted files

Purpose – An issue which tends to be ignored in information retrieval is the issue of updating inverted files. This is largely because inverted files were devised to provide fast query service, and much work has been done with the emphasis strongly on queries. In this paper we study the effect of using parallel methods for the update of inverted files in order to reduce costs, by looking at two...

متن کامل

Fast Text Access Methods for Optical and Large Magnetic Disks: Design and Performance Comparison

High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of inserti...

متن کامل

Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison

High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of inserti...

متن کامل

Using Additional Indexes for Fast Full-Text Search of Phrases That Contains Frequently Used Words

Searches for phrases and word sets in large text arrays by means of additional indexes are considered. Their use may reduce the query-processing time by an order of magnitude in comparison with standard inverted files.

متن کامل

Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files

There are many advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. This method provides an effective compromise between speed and space, running orders of magnitude faster than brute force search, but requ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004